Skip to content

Conversation

@cheese-head
Copy link
Contributor

@cheese-head cheese-head commented Jan 27, 2026

Overview:

Add infrastructure to track and report block IDs that fail during asynchronous KV cache load operations in the KVBM connector. This enables vLLM to identify failed blocks and take corrective action (e.g., mark for recomputation).

Details:

  • Add block_ids field to WorkerTransferRequest protocol for tracking which blocks belong to each transfer operation
  • Add failed_block_ids field and get_block_ids_with_load_errors() method to KvConnectorWorker
  • Expose the method through Python bindings (connector_worker.py, dynamo_connector.py)
  • Add failure_tx/failure_rx channel between Scheduler and WorkerSchedulerClient to propagate transfer failures
  • Add drain_failures() method to collect pending failure notifications
  • Track request_iduuidblock_ids mapping in worker to convert failure notifications to block IDs
  • Update handle_immediate_result() to send failure notifications when result.status.is_err()
  • Clean up tracking state when requests complete

Where should the reviewer start?

  1. lib/llm/src/block_manager/connector/scheduler.rs - Core failure notification channel implementation
  2. lib/bindings/kvbm/src/block_manager/vllm/connector/worker.rs - Block ID tracking and error aggregation logic
  3. lib/llm/src/block_manager/connector/protocol.rs - Protocol changes (block_ids field)

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

  • New Features

    • End-to-end error tracking for block load failures with a public way to retrieve failed block IDs for diagnostics.
    • Transfer operations now include block identifiers to improve error correlation during transfers.
    • Scheduler and workers gain failure signaling so load failures are propagated and recorded.
  • Tests

    • Updated tests to accommodate the new failure-tracking fields and behaviors.

✏️ Tip: You can customize this high-level summary in your review settings.

Add infrastructure to track block IDs that fail during async KV cache
loading, enabling vLLM to identify and handle failed blocks gracefully.

Changes:
- protocol.rs: Add block_ids field to WorkerTransferRequest
- scheduler.rs: Update tests with new block_ids field
- slot.rs: Populate block_ids for onboard/offload operations
- worker.rs: Add failed_block_ids tracking and get_block_ids_with_load_errors method
- connector_worker.py: Expose get_block_ids_with_load_errors to Python
- dynamo_connector.py: Implement KVConnectorBase_V1 interface method

Signed-off-by: Patrick Riel <priel@nvidia.com>
…acking

- Add failure_tx/failure_rx channel between Scheduler and WorkerSchedulerClient
- Add drain_failures() method to collect pending failure notifications
- Track request_id -> uuid -> block_ids mapping in KvConnectorWorker
- Populate failed_block_ids from scheduler failure notifications
- Clean up block_ids tracking when requests complete

Signed-off-by: Patrick Riel <priel@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the feat label Jan 27, 2026
@cheese-head cheese-head marked this pull request as ready for review January 27, 2026 00:39
@cheese-head cheese-head requested review from a team as code owners January 27, 2026 00:39
Signed-off-by: Patrick Riel <priel@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 27, 2026

Walkthrough

Adds end-to-end propagation and retrieval of block load failure IDs: transfer requests carry block_ids, scheduler signals failures, workers collect and expose failed IDs, and Python bindings provide get_block_ids_with_load_errors().

Changes

Cohort / File(s) Summary
Python connector API
lib/bindings/kvbm/python/kvbm/vllm_integration/connector/dynamo_connector.py, lib/bindings/kvbm/python/kvbm/vllm_integration/connector_worker.py
Added get_block_ids_with_load_errors() to DynamoConnector and KvConnectorWorker; method forwards/returns failed block IDs (returns empty set if no worker).
Rust worker tracking & bindings
lib/bindings/kvbm/src/block_manager/vllm/connector/worker.rs
Added per-request request_to_blocks mapping and failed_block_ids set; introduced process_pending_failures, cleanup_onboarding_request, and get_block_ids_with_load_errors(); exposed to Python via PyKvConnectorWorker.
Rust transfer protocol
lib/llm/src/block_manager/connector/protocol.rs
Added pub block_ids: Vec<usize> to WorkerTransferRequest with #[serde(default)] for error-tracking metadata.
Rust scheduler & failure signaling
lib/llm/src/block_manager/connector/scheduler.rs
Added failure_tx/failure_rx channels, drain_failures() on WorkerSchedulerClient, and failure-send logic in immediate-transfer handling; tests updated to include block_ids.
Rust leader transfer paths
lib/bindings/kvbm/src/block_manager/vllm/connector/leader/slot.rs
Capture and forward block_ids into WorkerTransferRequest during onboard/offload to enable downstream error correlation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I nibble logs and follow tracks of bits,
I gather IDs of blocks that missed their hits.
I carry them home in a tidy heap,
Then hand them back so the makers can keep —
a garden whole where transfers fit.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the main feature being added: a method to retrieve block IDs that encountered load errors during async operations.
Description check ✅ Passed The description is comprehensive and well-structured, following the template with clear overview, detailed technical changes, reviewer guidance, and related issue reference.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@lib/bindings/kvbm/src/block_manager/vllm/connector/worker.rs`:
- Around line 293-305: When enqueuing onboarding_operations we currently stash
request->blocks in request_to_blocks but later remove that mapping in
get_finished before we call get_block_ids_with_load_errors, which can drop the
failure->block correlation; fix by ensuring failures are drained/translated into
block IDs before cleanup: call get_block_ids_with_load_errors (or a shared
helper that drains scheduler failures and maps them to block IDs using
request_to_blocks) prior to removing the request's entry from request_to_blocks
in get_finished; factor the draining/mapping logic into a reusable helper (e.g.,
drain_and_map_load_failures) and use it from both the onboarding loop code paths
and the get_finished cleanup to preserve failure→block mappings.
🧹 Nitpick comments (1)
lib/llm/src/block_manager/connector/protocol.rs (1)

152-154: Confirm whether #[serde(default)] should mask missing block_ids.
Defaulting to empty accepts older senders but silently drops error tracking. If deployments are atomic and you want strictness, consider making the field required so deserialization fails fast when it’s absent.

💡 Possible tweak to keep the protocol explicit
-    #[serde(default)]
-    pub block_ids: Vec<usize>,
+    pub block_ids: Vec<usize>,
Based on learnings, keep the spec explicit if backward compatibility isn't required.

@cheese-head
Copy link
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 28, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Signed-off-by: Patrick Riel <priel@nvidia.com>
@cheese-head
Copy link
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 28, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@lib/bindings/kvbm/src/block_manager/vllm/connector/worker.rs`:
- Line 127: Remove the trailing whitespace on the affected line in
lib/bindings/kvbm/src/block_manager/vllm/connector/worker.rs (the line flagged
by CI around the worker module) so the file has no extraneous spaces at
end-of-line; you can delete the trailing space directly or run a whitespace
cleanup (editor trim trailing whitespace or rustfmt/trim tool) and re-run the
pre-commit hook to verify the CI error is resolved.
🧹 Nitpick comments (2)
lib/bindings/kvbm/src/block_manager/vllm/connector/worker.rs (2)

128-129: Track the TODO with an issue (or handle in this PR).

Leaving the “move this out of bindings” TODO without tracking risks it lingering.

Do you want me to propose a follow-up issue or a concrete refactor plan?


136-137: Avoid silent truncation when casting block IDs to u32.

If block_id can exceed u32::MAX, as u32 will truncate. Consider try_from (or an assertion) to make this safe.

♻️ Safer conversion
-                        for &block_id in block_ids {
-                            self.failed_block_ids.insert(block_id as u32);
-                        }
+                        for &block_id in block_ids {
+                            match u32::try_from(block_id) {
+                                Ok(block_id_u32) => {
+                                    self.failed_block_ids.insert(block_id_u32);
+                                }
+                                Err(_) => {
+                                    tracing::warn!(
+                                        request_id = %request_id,
+                                        operation_id = %uuid,
+                                        block_id,
+                                        "block_id exceeds u32::MAX; skipping"
+                                    );
+                                }
+                            }
+                        }

failed_block_ids: HashSet::new(),
})
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove trailing whitespace to fix the failing pre-commit hook.

Line 127 has trailing whitespace, which is already failing CI.

🤖 Prompt for AI Agents
In `@lib/bindings/kvbm/src/block_manager/vllm/connector/worker.rs` at line 127,
Remove the trailing whitespace on the affected line in
lib/bindings/kvbm/src/block_manager/vllm/connector/worker.rs (the line flagged
by CI around the worker module) so the file has no extraneous spaces at
end-of-line; you can delete the trailing space directly or run a whitespace
cleanup (editor trim trailing whitespace or rustfmt/trim tool) and re-run the
pre-commit hook to verify the CI error is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants